Method and apparatus for selective and power-aware memory error protection and memory management

ABSTRACT

A method for providing selective memory error protection responsive to a predictable failure notification associated with at least one portion of a memory in a computing system includes: obtaining an active error correcting code (ECC) configuration corresponding to the portion of the memory; determining whether the active ECC configuration is sufficient to correct at least one error in the portion of the memory affected by the predictable failure notification; when the active ECC configuration is insufficient to correct the error, determining whether data corruption can be tolerated by an application running on the computing system; when data corruption cannot be tolerated by the application, determining whether a stronger ECC level is available and, if a stronger ECC level is available, increasing a strength of the active ECC configuration; and when data corruption can be tolerated, performing page reassignment and aggregation of non-critical data.

STATEMENT OF GOVERNMENT RIGHTS

This invention was made with Government support under Contract No.B599858 awarded by the Department of Energy. The Government has certainrights in this invention.

BACKGROUND OF THE INVENTION

The present invention relates generally to the electrical, electronicand computer arts, and, more particularly, to memory error protectionand management.

In high-performance computing (HPC), typically two or more servers orcomputers are connected with high-speed interconnects in an HPC cluster.A cluster consists of several servers networked together that act as asingle system, where each server in the cluster performs one or morespecific tasks. Each of the individual computers or servers in thecluster may be considered a node. The nodes work together to accomplishan overall objective. As such, subtasks are executed on the nodes inparallel to accomplish the overall objective. However, a failure of anyone subtask results in a failure of the entire parallel task.

Uncorrected errors in the main memory (“memory”) of the computer are oneof the primary reasons HPC systems crash or fail. For example,uncorrected errors may cause a crash due to an unrecoverable corruptionof an operating system of the HPC system or an application running onthe HPC system, which then may require the system or application to berestarted. After the crash, sometimes the application may resume from apredefined checkpoint.

A machine check is one way in which system hardware may indicate aninternal error. Machine check handlers have been used to signal to theoperating system the occurrence of memory parity check errorsencountered by a memory controller and that cannot be corrected by amemory protection mechanism, such as error-correcting codes (ECC), forinstance. The memory controller also accounts for corrected and harmlesserrors. Corrected and harmless errors are errors that do not generate amachine check exception. A machine check exception occurs when an errorcannot be corrected by the hardware and in turn signals a machine checkhandler. Corrected and harmless errors may typically be tracked. Logs ofcorrected errors and the monitoring of a corrected error count comparedto static thresholds have been used in proactive HPC system failureavoidance.

BRIEF SUMMARY

Principles of the invention, in accordance with one or more embodimentsthereof, provide techniques for anticipating impacts of imminent memoryfailure when variable strength error-correcting codes (ECC) are used,and to perform memory management with adjustable memory error protectionbased, at least in part, on selective reliability for energy-efficiencyimprovement.

In one aspect, a method for providing selective memory error protectionresponsive to a predictable failure notification associated with atleast one portion of a memory in a computing system, according to anaspect of the invention, includes the steps of: obtaining an active ECCconfiguration corresponding to the at least one portion of the memory;determining whether the active ECC configuration is sufficient tocorrect at least one error in the portion of the memory affected by thepredictable failure notification; when the active ECC configuration isinsufficient to correct the at least one error, determining whether datacorruption can be tolerated by an application running on the computingsystem; when data corruption cannot be tolerated by the application,determining whether a stronger ECC level is available and, if a strongerECC level is available, increasing a strength of the active ECCconfiguration; and when data corruption can be tolerated, performingpage reassignment and aggregation of non-critical data.

In another aspect, an apparatus for performing selective memory errorprotection in a high-performance computing system is provided. Theapparatus includes a memory and at least one processor coupled to thememory. The processor, responsive to a predictable failure notificationassociated with at least one portion of the memory, is configured: toobtain an active ECC configuration corresponding to the at least oneportion of the memory; to determine whether the active ECC configurationis sufficient to correct at least one error in the at least one portionof the memory affected by the predictable failure notification; todetermine, when the active ECC configuration is insufficient to correctthe at least one error, whether data corruption can be tolerated by anapplication running on the computing system; to determine, when datacorruption cannot be tolerated by the application, whether a strongerECC level is available and if, a stronger ECC level is available, toincrease a strength of the active ECC configuration; and, when datacorruption can be tolerated, to perform page reassignment andaggregation of non-critical data in the memory.

As used herein, “facilitating” an action includes performing the action,making the action easier, helping to carry the action out, or causingthe action to be performed. Thus, by way of example and not limitation,instructions executing on one processor might facilitate an actioncarried out by instructions executing on a remote processor, by sendingappropriate data or commands to cause or aid the action to be performed.For the avoidance of doubt, where an actor facilitates an action byother than performing the action, the action is nevertheless performedby some entity or combination of entities.

One or more embodiments of the invention or elements thereof can beimplemented in the form of a computer program product including acomputer readable storage medium with computer usable program code forperforming the method steps indicated. Furthermore, one or moreembodiments of the invention or elements thereof can be implemented inthe form of a system (or apparatus) including a memory, and at least oneprocessor that is coupled to the memory and operative to performexemplary method steps. Yet further, in another aspect, one or moreembodiments of the invention or elements thereof can be implemented inthe form of means for carrying out one or more of the method stepsdescribed herein; the means can include (i) hardware module(s), (ii)software module(s) stored in a computer readable storage medium (ormultiple such media) and implemented on a hardware processor, or (iii) acombination of (i) and (ii); any of (i)-(iii) implement the specifictechniques set forth herein.

Techniques of the present invention can provide substantial beneficialtechnical effects. By way of example only and without limitation, one ormore embodiments may provide one or more of the following advantages:

-   -   Provisioning of information about imminent failure to the        operating system, enabling proactive adaptation before an        uncorrectable error occurs;    -   Support for an application's notification about imminent        failure, allowing the application to take proactive        fault-handling actions at an application-level;    -   Improved system reliability through the anticipation of        uncorrectable memory errors when ECC level is dynamically        adjusted;    -   Proactive and dynamic coordination of ECC level based on data        classification considering tolerance to corruption;    -   Control of memory protection in fine-grained resolution based on        data classification; and    -   Enabling power savings by the controlled corruption of        non-critical data in cooperation with application.

These and other features and advantages of the present invention willbecome apparent from the following detailed description of illustrativeembodiments thereof, which is to be read in connection with theaccompanying drawings.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

The following drawings are presented by way of example only and withoutlimitation, wherein like reference numerals (when used) indicatecorresponding elements throughout the several views, and wherein:

FIG. 1 is a block diagram depicting details of an exemplary system,according to an embodiment of the invention;

FIG. 2 is a block diagram depicting at least a portion of an exemplarymemory health tracking module suitable for use in the exemplary systemshown in FIG. 1, according to an embodiment of the invention;

FIG. 3 is a flow diagram depicting at least a portion of an exemplarymethod for performing memory error protection and management, accordingto an embodiment of the invention; and

FIG. 4 depicts a computer system that may be useful in implementing oneor more aspects and/or elements of the invention.

It is to be appreciated that elements in the figures are illustrated forsimplicity and clarity. Common but well-understood elements that may beuseful or necessary in a commercially feasible embodiment may not beshown in order to facilitate a less hindered view of the illustratedembodiments.

DETAILED DESCRIPTION

Principles of the present invention will be described herein in thecontext of illustrative embodiments of a computing system and method foranticipating the impacts of imminent memory failure whenvariable-strength error-correcting codes (ECC) are employed, and toperform memory management with adjustable memory error protection basedon selective reliability for improving energy efficiency. It is to beappreciated, however, that the invention is not limited to the specificapparatus and/or methods illustratively shown and described herein.Rather, aspects of the present disclosure relate more broadly toapparatus and methods for providing selective memory error protectionand memory management in a high-performance computing system in a mannerwhich enhances power efficiency. Moreover, it will become apparent tothose skilled in the art given the teachings herein that numerousmodifications can be made to the embodiments shown that are within thescope of the claimed invention. That is, no limitations with respect tothe embodiments shown and described herein are intended or should beinferred.

As previously stated, uncorrected errors in the memory of highperformance computers (HPCs) are one of the primary causes of HPC systemcrashes, where the system or application may need to be restarted. Inaddition to uncorrected memory parity check and other errors, a memorycontroller may account for corrected and harmless errors, and store thisinformation in the form of error logs in some storage means. Error logsand the monitoring of a corrected error rate and its comparison to astatic threshold have been used to proactively avoid system failure orcrashing; the terms “system failure” and “system crash” are usedsynonymously herein. However, the absolute rate of corrected memoryerrors, which may be determined from corrected error monitoring, is nota direct indication of a probable future memory failure, since memoryfailure is typically a dynamic function of one or more characteristics,including, but not limited to, manufacturing variation, surroundingconditions (e.g., temperature, supply voltage, etc.) and workloadphases. For example, the shift in a threshold voltage of complementarymetal-oxide-semiconductor (CMOS) transistors, which are often used inmemory, may vary widely among individual semiconductor chips before theyare deployed to the field due to manufacturing process variations,including variations in semiconductor oxide thickness, effective channellength and/or width of semiconductor transistors, and burn-in testsusing higher voltages and/or temperatures.

One technique used to indicate early signs of memory health degradation,which can be used to determine the probability of memory failures, ishealth monitoring (i.e., health tracking) Health monitoring is atechnique that relies on capabilities sometimes found in commodity andHPC components that provides sensor information which indicates memoryand surrounding conditions. The data obtained by the sensors may be usedto detect or indicate degradation processes such as, for example,electromigration (EM), negative bias temperature instability (NBTI),positive bias temperature instability (PBTI), temperature-dependentdielectric breakdown (TDDB), and hot carrier injection (HCl), amongother conditions. A correlation between the information received orobtained by the sensors is used to predict memory failure dynamically atruntime. In particular, when a corrected error rate increases andexceeds (e.g., crosses) a statistically-defined threshold, one or moreof the above memory health degradation conditions can be detected. Oneor more embodiments of the invention utilize notifications based on thereading and interpretation of the sensor data to determine whether apredicted failure is sufficient to exceed an error correction thresholdof ECC applied to a specific portion of memory predicted to be affected.

The actual occurrence of uncorrectable memory errors caused by apredicted failure is dependent on the mechanism used for memory errorprotection, which is often statically defined and uniformly applied tothe entire memory. Variable-length ECC mechanisms can be used in themitigation of the high-performance and power overhead associated withtraditional memory error detection and correction mechanisms.Transparent and dynamically-adjusted ECC has been proposed as amechanism to correct memory errors introduced by low-power operation incache architectures. It can be shown that failure probability is notuniformly distributed in this case and only a small number of cachelines are affected, for which stronger memory protection can beselectively applied. Moreover, transparent and auto-adjusted ECC is usedto correct memory errors introduced by refresh rate prolongation forpower optimization.

As an alternative to transparent and auto-adjusted ECC approaches,memory architectures with software-adjustable ECC strength have beenproposed for memory usage and power optimization (see, e.g., C.-H. Lin,D.-Y. Shen, Y.-J. Chen, C.-L. Yang, and M. Wang, “SECRET: SelectiveError Correction for Refresh Energy Reduction in DRAMs,” 2012 IEEE 30thInternational Conference on Computer Design (ICCD), 2012, pp. 67-74, thedisclosure of which is incorporated herein by reference in itsentirety). A commercial off-the-shelf (COTS) memory controller thatallows the ECC level to be dynamically set by the operating system isdescribed, for example, in Feng Qin et al., “SafeMem: ExploitingECC-Memory for Detecting Memory Leaks and Memory Corruption DuringProduction Runs,” In Proceedings of the 11th International Symposium onHigh-Performance Computer Architecture (HPCA), 2005, pp. 291-302, thedisclosure of which is incorporated herein by reference in its entirety.

Achievable power reduction in auto-adjusted ECC memory systems typicallydepends on the control over the power state of additional storage andoperation of specific logic used to implement the ECC. The additionalstorage requirement for many existing ECC implementations corresponds toabout 12.5 percent of the available memory. However, while theseapproaches may provide some reduction in power consumption, the lack ofinformation regarding health degradation related to non-uniform agingprocesses makes memory protected with auto-adjusted ECC potentially morevulnerable. Specifically, weaker ECC can be inadvertently applied tomemory areas with predictable failures associated with degradationprocesses like NBTI or PBTI, for example. Additionally, the lack ofawareness of data criticality at the hardware level prevents theexploitation of relaxed error protection in areas of memory holdingnon-critical data for power optimization.

The concept of data classes has been explored in different contexts forimproving system resilience and power optimization. Allocation ofnon-critical data in memory with a lowered refresh rate can be used as amechanism for power optimization. (See, e.g., Song Liu, et al.,“Flikker: Saving DRAM Refresh-power Through Critical Data Partitioning,”In Proceedings of the Sixteenth International Conference onArchitectural Support for Programming Languages and Operating Systems(ASPLOS XVI), pp. 213-224, ACM, New York, NY, 2011, the disclosure ofwhich is incorporated by reference herein in its entirety). Non-criticaldata structures and fault-tolerance of representative applications mayalso be beneficial to some extent for power optimization. (See, e.g.,Song Fu and Cheng-Zong Xu, “Exploring Event Correlation for FailurePrediction in Coalitions of Clusters,” In Proceedings of the 2007ACM/IEEE Conference on Supercomputing, pp. 41:1-41:12, New York, NY,2007, the disclosure of which is incorporated by reference herein in itsentirety). However, each of these mechanisms has limited effectivenessas used in existing solutions.

One or more embodiments of the invention provide a system, method and/orapparatus for anticipating the impacts of imminent memory failure whenvariable-strength ECC is used, and for performing memory managementemploying adjustable memory error protection based at least in part onselective reliability for energy efficiency improvement.Energy-efficiency improvement, according to one or more aspects of theinvention, is a result of the decision to trade power savings for somepotential corruption to selected data, where tolerable in a givenapplication. This concept is referred to herein as “selectivereliability.” Note, that the use of stronger error correction in amemory generally results in the memory consuming more power. Thus, theinvention, in one or more embodiments, facilitates energy-efficiencyimprovement by allowing the application to determine different levels ofreliability depending on the data stored in different parts of thememory and controlling the strength of the ECC used accordingly. Amethod according to one or more embodiments is implemented as anoperating system component, although aspects of the invention are notlimited to such an implementation.

Aspects of the present disclosure are applicable to dynamic randomaccess memory (DRAM), although embodiments are not limited to DRAM. Forexample, one or more embodiments of the invention are applicable tophase-change memory (PCM), resistor memory, and flash memory insolid-state drives (SSDs), among other memory types. The main differencein this case is in the implementation; in the case of flash in SSDs forinstance, the implementation could be done at the flash controller leveland be potentially transparent to the software. Dealing with dataselectively in this case (e.g., indicating some files that can toleratesome corruption) would generally require support in the file system.

In particular, one or more embodiments provide for memory degradationnotification at the operating system level of an HPC system forpreventive notification of imminent or future memory failure. In thismanner, memory management actions can be taken to avert a system failureor crash. One or more embodiments of the invention rely on a combinationof health monitoring and corrected memory error monitoring (e.g.,corrected errors with ECC, although essentially any variation in therate of corrected error events produced by a correction mechanism can beused as a first sign of health degradation that triggers a healthevaluation) to generate a notification indicating an imminent memoryfailure. As used herein, an imminent memory failure can be broadlydefined as a future memory failure. The timeframe for a failure to occurmay be, in one or more embodiments, intrinsically associated with aprescribed prediction accuracy and the particular mechanism using thenotification; such timeframe may depend on the application.

In one or more embodiments, techniques of the invention rely, at leastin part, on monitoring sensors in the main memory of the HPC system atruntime to dynamically predict the likelihood of a failure occurring inone or more memory blocks at per-die or finer granularity, andgenerating a notification (e.g., using a signal) indicating an imminentmemory failure. One or more embodiments provide inexpensive access toreal-time corrected memory error events and reconcile the events withhealth monitors accessible through monitoring interfaces. In thismanner, embodiments of the invention define a correlation between acorrected memory error rate increase, as determined from monitoring ofcorrected error events, and the likelihood of a memory failure, asdetermined from health monitoring systems or alternative sensing means.An increase in corrected error rate is considered, in one or moreembodiments, a first sign of a potential future memory failure. Anincrease in corrected error rate can be caused, in some instances, by afaulty memory portion being accessed multiple times or an evolvingdegradation processes. Other suitable causations for an increase incorrected error rate may apply.

Additionally, by including a hardware-specific correlation function in ahardware-independent failure model implementation, embodiments of theinvention help ensure that health monitor readings, notificationsettings and event notifications are performed in a simplified andextensible way, applicable to a wide variety of scenarios. For example,hardware-independent failure models for a given memory technology can beused in distinct platforms, including distinct architecture and a memorycontroller. In accordance with one or more embodiments, notificationpreferences or settings enable controlling a tradeoff between the time(or time period) necessary for taking corrective actions before afailure occurs and the overhead of ensuring against a false positive.Embodiments of the invention provide for early signs of memory failurein proactive operating system mechanisms to be used to avoid systemfailure.

In some embodiments, a health tracking module, details of which aredescribed below in conjunction with FIGS. 1 and 2, provides a mechanismfor notification of predictable failure in a memory or memory segmentbased on memory health deterioration. The module accesseshardware-specific health indicators (e.g., sensors) generatinghardware-independent notification of memory health deterioration.Notifications of predictable memory failure, according to one or moreembodiments, are based on the reading and interpretation of sensorsproviding indicators to detect degradation processes such as, forexample, EM, NBTI, PBTI, TDDB and HCl, among other conditions, aspreviously stated.

An ECC strength applied to the specified portion or portions of memoryin which a failure is predicted to occur is obtained, such as, forexample, from memory controller settings, and used to determine whetherthe predicted failure can be corrected by the active ECC mechanism(i.e., the ECC mechanism currently being used) or whether anuncorrectable memory error will result. ECC strength, in one or moreembodiments, can be measured as a function of one or more parameters ofthe active ECC mechanism, such as, but not limited to, correcting codelength and/or correction capability. According to aspects of theinvention, when it is determined (e.g., by the operating system oranother system module) that the active ECC mechanism is insufficient tocorrect the predicted memory failures, the memory controller isoperative to invoke a stronger ECC mechanism, if available, or tomitigate power overhead by tolerating data corruption through pagereassignment and aggregation of non-critical data indicated by theapplication in the specified portion(s) of memory.

In order to further reduce power consumption in the system, the strengthof the active ECC mechanism may be lowered (i.e., weakened) in portionsof the memory in which an imminent failure is not predicted to occur andyet data stored in such portions of the memory are tolerable topotential corruption. As previously stated, power consumption in thememory directly correlates with the strength of the ECC mechanism usedin the memory, and thus by lowering the strength of the active ECCmechanism in those portions of the memory which contain data tolerableto corruption, a reduction in power consumption can be achieved, inaccordance with one or more embodiments of the invention.

With reference now to FIGS. 1 and 2, at least a portion of an exemplaryHPC system 100 is depicted in FIG. 1, with an illustrative healthtracking module implementation 200 depicted in FIG. 2, according to oneor more embodiments of the invention. The exemplary system 100 includesboth hardware components (HW) 102 and software components (SW) 104. Insome embodiments, an interface 106 is used to facilitate the interactionbetween at least a portion of the hardware components 102 and thesoftware components 104. The interface 106 may be a low-level interface,such as, for example, an intelligent platform management interface(IPMI) or an IBM remote supervisor adapter (RSA) interface; othersuitable interfaces may also be used. The hardware components 102, inthis embodiment, include a memory 108, a controller 112, and one or moresensors 110 configured to obtain information (i.e., detect) regardingconditions affecting the memory, or a segment thereof.

As described herein, the term “memory” may be used to indicate thememory as a whole or a particular segment thereof. Additionally, thememory 108 may comprise a standalone memory and/or an embedded memory.The memory 108 may comprise, for example, random access memory (RAM),such as, for instance, DRAM or static random access memory (SRAM). Othersuitable memory types may also be used, including, but not limited to,content-addressable memory (CAM), phase change RAM (PCRAM), magnetic RAM(MRAM), etc.

The controller 112 included in the hardware components 102 is coupledwith the memory 108 and is configured to control an operation of thememory 108. In one or more embodiments, the controller 112 is operativeto vary a strength of the ECC mechanism used to protect data stored inthe memory 108, as will be described further below. As shown in FIG. 1,the controller 112 resides externally to the memory 108, although atleast a portion of the controller 112, or functionality thereof, may beintegrated into the memory 108, according to one or more embodiments.

The memory 108 includes a plurality of memory cells (or blocks of memorycells) 114 divided into multiple memory banks, 116-1, 116-2 through116-m, where m is an integer representing the number of banks of memorycells. In this embodiment, each of the memory banks 116-1 through 116-mincludes n memory cells (or blocks of memory cells), where n is aninteger. In one or more other embodiments, the memory banks 116-1through 116-m may not all have the same number of memory cells; that is,the disclosure contemplates alternative ways in which to organize thememory cells. Each of the memory banks 116-1, 116-2 through 116-mincludes a corresponding ECC module, 118-1, 118-2 through 118-m,respectively. Each of the

ECC modules 118-1 through 118-m is configured to provide variable errorcorrection functionality to its corresponding memory bank 114-1 through114-m. The error correction functionality implemented by each of the ECCmodules 118-1 through 118-m is independently controlled, in thisembodiment, by the memory controller 112. For example, in one or moreembodiments, a strength of an ECC mechanism implemented in each of theECC modules 118-1 through 118-m is selectively controlled by the memorycontroller 112 as a function of one or more control signals received bythe memory controller. A methodology for controlling the errorcorrection functionality of the ECC modules 118-1 through 118-m will bedescribed in further detail herein below.

The sensors 110 preferably reside proximate to the memory 108; at leasta portion of the sensors may be incorporated into the memory as shown inFIG. 1, although all or a portion of the sensors 110 may resideexternally to the memory. The sensors 110 are operative to obtaininformation regarding one or more conditions affecting a performance ofthe memory 108 and/or overall system 100. For example, in one or moreembodiments, conditions (or parameters) sensed or detected by thesensors 110 include power, temperature and aging variations; otherconditions may be sensed or detected. In one or more embodiments, thesensors 110 comprise performance counters for tracking memory accesspatterns, (i.e., read/write operations). The sensors 110 may beconfigured to track other memory conditions as well.

In one or more embodiments, the software components 104 include a memoryhealth tracking module 120, an operating system (OS) 122, and anapplication program interface (API) 124, which includes a runtimecomponent 126 and an application component 128 associated therewith. Thememory health tracking module 120, in this embodiment, is adapted toreceive one or more signals generated by the sensors 110, eitherdirectly or through interface 106, and is operative to generate, as afunction of the sensor signals, a failure notification output. Thefailure notification may be utilized by a component of the operatingsystem (OS) 122, such as a selective reliability manager module 130,and/or by another software component in the system 100.

The selective reliability manager module 130, in one or moreembodiments, is configured to communicate with a memory allocationmodule 132 included in the operating system 122 and an application-levelfault management module 140 included in the application 128. Theselective reliability manager module 130 also communicates with thememory controller 112. The selective reliability manager module 130,when used in conjunction with the memory controller 112, memoryallocation module 132 and the application-level fault management module140, provides a beneficial mechanism to control the active ECCconfiguration for effectively balancing memory error protection andpower consumption objectives in the system 100. In one or moreembodiments, the selective reliability manager module 130 is configuredto read an ECC configuration from the memory controller 112, makedecisions based on data classifications and/or other characteristics ofthe data, guide memory allocation based on these decisions, and defineECC configurations to be applied by the memory controller to thedifferent memory banks 114-1 through 114-m, as will be described infurther detail below.

FIG. 2 is a block diagram depicting at least a portion of an exemplarymemory health tracking module 200, according to an embodiment of theinvention. The memory health tracking module 200 represents anillustrative implementation of the memory health tracking module 120shown in FIG. 1. In this embodiment, the memory health tracking module200 includes a corrected error rate module 210, which is adapted toreceive information from the memory controller 112 regarding correctederror events. The corrected error rate module 210, in some embodiments,is adapted to use this information to determine whether there is anincrease in the corrected error rate. The memory health tracking module200 also includes a failure probability calculation module 212, which isadapted to receive information from one or more hardware components 102,such as from the sensors 110, and/or from one or more softwarecomponents 104 (FIG. 1), such as one or more failure models or tests218, and is operative to calculate a failure probability as a functionof the received information.

The memory health tracking module 200 further includes a failureprobability threshold module 220 which is adapted to receive informationfrom the one or more failure models 218 and notification preferences 214and is configured to calculate a failure probability threshold as afunction of the received information. The notification preferences 214(e.g., as may be embodied in a notification settings module, forexample) can, in one or more embodiments, be controlled by a user. Thenotification preferences 214 may be related to an action time window forwhen the imminent failure will occur or to a prediction accuracy ofimminent failure.

As apparent from FIG. 2, the illustrative memory health tracking module200 further includes a monitoring module 222 operative to monitor one ormore aspects of the system 100 (FIG. 1), and when the thresholdcalculated by the threshold module 220 is exceeded (or otherwisecrossed), the memory health tracking module, in some instances,generates one or more signals 224 indicative of an imminent memoryfailure for the memory 108 (FIG. 1) as a whole, or a particular segmentthereof.

With continued reference to FIG. 1, the memory health tracking module120 is depicted as residing externally to the operating system 122,although it is to be appreciated that at least a portion of the memoryhealth tracking module may, in some embodiments, be incorporated intothe operating system or another module. Additionally, although thememory health tracking module 120 is shown as being implemented entirelyas a software component 104, at least a portion of the memory healthtracking module may be implemented as a hardware component 102.

In terms of operation, FIG. 3 is a flow diagram depicting at least aportion of an exemplary memory error protection and managementmethodology 300, according to an embodiment of the invention. Withreference to FIGS. 1 and 3, the memory error protection and managementmethodology 300 begins in step 302 with the operating system 122receiving a failure notification, such as from the memory healthtracking module 120, regarding memory health degradation indicating apredicted imminent failure affecting a prescribed portion of the memory108. In one or more embodiments, this failure notification is suppliedto the selective reliability manager module 130 running on the operatingsystem 122.

In step 304, the current ECC configuration used in the prescribedportion of memory affected by the predicted imminent failure isobtained. The current ECC configuration (e.g., ECC strength), which maybe determined as a function of correcting code length and/or correctioncapability, among other parameters, is obtained from settings of thememory controller 112, in one or more embodiments, although other meansfor determining the current (i.e., active) ECC configuration arecontemplated. Information regarding the current ECC configuration isused, in step 304, to determine whether the predicted failure(s) can becorrected by the active ECC mechanism, or whether an uncorrectablememory error will occur; that is, the current ECC configuration is usedto evaluate the impact of the predicted imminent failure on theprescribed portion(s) of the memory 108 (e.g., whether or not the errorcan be corrected by the current ECC).

The operating system 122 (e.g., the selective reliability manager module124) determines whether to request that the memory controller activate astronger ECC level, if available, or to mitigate power overhead relatedto stronger ECC by tolerating data corruption, such as, for example,through page reassignment and aggregation of noncritical data indicatedby the application in the prescribed memory area. Thus, in step 306, adetermination is made as to whether the current ECC configuration issufficient to correct one or more errors in the prescribed portion ofmemory impacted by the predicted imminent memory failure. If the currentECC configuration is determined to be sufficient to correct errors inthe prescribed portion of memory impacted by the predicted imminentmemory failure, no change in ECC strength is required and the method 300continues to step 302 to wait for the next memory failure notification.

Alternatively, if the current ECC configuration is determined in step306 to be insufficient to correct errors in the prescribed portion ofmemory impacted by the predicted imminent memory failure, step 308obtains information, such as from the application 128 (e.g., fromapplication-level fault-management module 140), which is used indetermining whether selective data corruption can be tolerated for poweroptimization. Energy-efficiency improvement can be achieved with thedecision by the application to trade energy savings for potentialcorruption of selected data. Energy savings are obtained by avoidinghigher-level ECC, which correlates to higher power consumption in thesystem.

In step 310, a determination is made as to whether or not datacorruption can be tolerated by the application 128. If data corruptioncan be tolerated (e.g., the application 128 includes fault-tolerant datastructures 142), a request is supplied to the memory allocation module132 running on the operating system 122 to perform page reassignment andaggregation of noncritical data in step 312 relating to at least theprescribed portion(s) of the memory 108 affected by the predictableimminent memory failure to thereby mitigate the effects of the memoryfailure. In one or more embodiments, memory mapping is used to identifymemory pages and processes expected to be affected. The method 300 thencontinues at step 302 to wait for the next memory failure notification.

If it is determined in step 310 that data corruption cannot betolerated, step 314 determines whether or not a stronger ECC level isavailable. If a stronger ECC level is not available, step 316 notifiesthe application of a likely imminent unrecoverable error in theprescribed portion of the memory 108. Otherwise, if a stronger ECC levelis available, the ECC strength is increased in step 318 in at least theprescribed portion(s) of the memory 108 affected by the predictedimminent memory failure. The method 300 then returns to step 302 to waitfor the next memory failure notification.

Aspects of the present invention monitor and adjust ECC schemesdynamically. By way of example only and without limitation or loss ofgenerality, one or more embodiments rely on an interaction with thememory controller to access specific registers indicating the currentECC implementation enabled (i.e., active ECC scheme). An interfaceaccessible by the operating system is used in one or more embodiments.Auto-adjusted ECC by hardware mechanisms, while implementing specificand software-transparent methods to determine the ECC level needed for agiven scenario, are expected to update read-only status registersindicating the error protection strength enabled. Software-adjustedvariable ECC provides an interface for the operating system to setmemory controller registers for prescribed error protection levels. TheECC level corresponding to each bank of memory (e.g., 118-1 through118-m in FIG. 1) may be independently set and may be different relativeto one or more other banks of memory, in accordance with one or moreembodiments.

Combining the notification of imminent memory failure and ECC settings,aspects of the present disclosure determine whether an uncorrectablememory error is predictable and determine the physical address oraddresses likely to be affected by the predicted memory failure. Memorymapping is used to identify memory pages and processes expected to beaffected. In embodiments wherein the ECC level is auto-adjusted, theoperating system 122 can only notify the affected processes about apredicted uncorrectable memory error. Relying on an API 124, theoperating system 122 notifies applications 128 about an imminent systemcrash in case the error is expected to corrupt the operating system, orindicates memory pages that will be corrupted for application-levelfault handling.

In embodiments wherein the memory controller allows the ECC level to bedynamically defined by the operating system, aspects of the presentdisclosure can be used to selectively manage memory reliability. Thisapproach enables power reduction by using weaker ECC levels fornon-critical data that can be at least partially corrupted. Embodimentsof the invention apply the concept of data classes and cooperate withthe application through the API to indicate non-critical data. Strongerand more costly ECC (in terms of system overhead and/or complexity, forexample), in one or more embodiments, is proactively applied toprescribed memory areas holding critical data when an uncorrectableerror is predicted to affect these prescribed memory areas; for example,memory areas used by critical operating system data structures and othercritical data indicated by the application.

When no stronger ECC is available, a notification is generated (e.g., bythe operating system) indicating the expected corruption of criticaldata due to an uncorrectable memory error. When the uncorrectable erroris predicted to impact memory areas holding non-critical data, memoryerror protection is relaxed, in one or more embodiments, based on atolerance threshold. With virtual memory support, memory pages includingcritical application data is reassigned and aggregated for proactivestrengthening of memory error protection, according to one or moreembodiments of the invention.

Given the discussion thus far, it will be appreciated that, in generalterms, an exemplary method for providing selective memory errorprotection responsive to a predictable failure notification associatedwith at least one portion of a memory in a computing system, accordingto an aspect of the invention, includes the steps of: obtaining anactive ECC configuration corresponding to the at least one portion ofthe memory; determining whether the active ECC configuration issufficient to correct at least one error in the portion of the memoryaffected by the predictable failure notification; when the active ECCconfiguration is insufficient to correct the at least one error,determining whether data corruption can be tolerated by an applicationrunning on the computing system; when data corruption cannot betolerated by the application, determining whether a stronger ECC levelis available and, if a stronger ECC level is available, increasing astrength of the active ECC configuration; and when data corruption canbe tolerated, performing page reassignment and aggregation ofnon-critical data. The present invention further provides, in someembodiments, notifying the application of an imminent unrecoverableerror when data corruption cannot be tolerated by the application and astronger ECC level is not available.

The present invention provides, in one or more embodiments, using memorymapping to identify memory pages and/or processes expected to beaffected by a predictable failure notification.

In one or more embodiments, the present invention provides an apparatusfor performing selective memory error protection in a high-performancecomputing system. The apparatus includes a memory and at least oneprocessor coupled with the memory. The processor, responsive to apredictable failure notification associated with at least one portion ofthe memory, is configured: to obtain an active ECC configurationcorresponding to the at least one portion of the memory; to determinewhether the active ECC configuration is sufficient to correct at leastone error in the at least one portion of the memory affected by thepredictable failure notification; to determine, when the active ECCconfiguration is insufficient to correct the at least one error, whetherdata corruption can be tolerated by an application running on thecomputing system; to determine, when data corruption cannot be toleratedby the application, whether a stronger ECC level is available and if, astronger ECC level is available, to increase a strength of the activeECC configuration; and, when data corruption can be tolerated, toperform page reassignment and aggregation of non-critical data in thememory.

Exemplary System and Article of Manufacture Details

As will be appreciated by one skilled in the art, aspects of the presentinvention may be embodied as a system, method or computer programproduct. Accordingly, aspects of the present invention may take the formof an entirely hardware embodiment, an entirely software embodiment(including firmware, resident software, micro-code, etc.) or anembodiment combining software and hardware aspects that may allgenerally be referred to herein as a “circuit,” “module” or “system.”Furthermore, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon.

One or more embodiments of the invention, or elements thereof, can beimplemented in the form of an apparatus including a memory and at leastone processor that is coupled to the memory and operative to performexemplary method steps.

One or more embodiments can make use of software running on a generalpurpose computer or workstation which, when configured according to oneor more embodiments of the invention, becomes a special-purposeapparatus. With reference to FIG. 4, such an implementation mightemploy, for example, a processor 402, a memory 404, and an input/outputinterface formed, for example, by a display 406 and a keyboard 408. Theterm “processor” as used herein is intended to include any processingdevice, such as, for example, one that includes a CPU (centralprocessing unit) and/or other forms of processing circuitry. Further,the term “processor” may refer to more than one individual processor.The term “memory” is intended to include memory associated with aprocessor or CPU, such as, for example, RAM (random access memory), ROM(read only memory), a fixed memory device (for example, hard drive), aremovable memory device (for example, diskette), a flash memory and thelike. In addition, the phrase “input/output interface” as used herein,is intended to include, for example, one or more mechanisms forinputting data to the processing unit (for example, mouse), and one ormore mechanisms for providing results associated with the processingunit (for example, printer). The processor 402, memory 404, andinput/output interface such as display 406 and keyboard 408 can beinterconnected, for example, via bus 410 as part of a data processingunit 412. Suitable interconnections, for example via bus 410, can alsobe provided to a network interface 414, such as a network card, whichcan be provided to interface with a computer network, and to a mediainterface 416, such as a diskette or CD-ROM drive, which can be providedto interface with media 418.

Accordingly, computer software including instructions or code forperforming the methodologies of the invention, as described herein, maybe stored in one or more of the associated memory devices (for example,ROM, fixed or removable memory) and, when ready to be utilized, loadedin part or in whole (for example, into RAM) and implemented by a CPU.Such software could include, but is not limited to, firmware, residentsoftware, microcode, and the like.

A data processing system suitable for storing and/or executing programcode will include at least one processor 402 coupled directly orindirectly to memory elements 404 through a system bus 410. The memoryelements can include local memory employed during actual implementationof the program code, bulk storage, and cache memories which providetemporary storage of at least some program code in order to reduce thenumber of times code must be retrieved from bulk storage duringimplementation.

Input/output or I/O devices (including but not limited to keyboards 408,displays 406, pointing devices, and the like) can be coupled to thesystem either directly (such as via bus 410) or through intervening I/Ocontrollers (omitted for clarity).

Network adapters such as network interface 414 may also be coupled tothe system to enable the data processing system to become coupled toother data processing systems or remote printers or storage devicesthrough intervening private or public networks. Modems, cable modem andEthernet cards are just a few of the currently available types ofnetwork adapters.

As used herein, including the claims, a “server” includes a physicaldata processing system (for example, system 412 as shown in FIG. 4)running a server program. It will be understood that such a physicalserver may or may not include a display and keyboard.

As noted, aspects of the present invention may take the form of acomputer program product embodied in one or more computer readablemedium(s) having computer readable program code embodied thereon. Anycombination of one or more computer readable medium(s) may be utilized.The computer readable medium may be a computer readable signal medium ora computer readable storage medium. A computer readable storage mediummay be, for example, but not limited to, an electronic, magnetic,optical, electromagnetic, infrared, or semiconductor system, apparatus,or device, or any suitable combination of the foregoing. Media block 418is a non-limiting example. More specific examples (a non-exhaustivelist) of the computer readable storage medium would include thefollowing: an electrical connection having one or more wires, a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor flash memory), an optical fiber, a portable compact disc read-onlymemory (CD-ROM), an optical storage device, a magnetic storage device,or any suitable combination of the foregoing. In the context of thisdocument, a computer readable storage medium may be any non-transitorymedium that can contain, or store a program for use by or in connectionwith an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmittedusing any appropriate medium, including but not limited to wireless,wireline, optical fiber cable, RF, etc., or any suitable combination ofthe foregoing.

Computer program code for carrying out operations for aspects of thepresent invention may be written in any combination of one or moreprogramming languages, including an object oriented programming languagesuch as Java, Smalltalk, C++ or the like and conventional proceduralprogramming languages, such as the “C” programming language or similarprogramming languages. The program code may execute entirely on theuser's computer, partly on the user's computer, as a stand-alonesoftware package, partly on the user's computer and partly on a remotecomputer or entirely on the remote computer or server. In the latterscenario, the remote computer may be connected to the user's computerthrough any type of network, including a local area network (LAN) or awide area network (WAN), or the connection may be made to an externalcomputer (for example, through the Internet using an Internet ServiceProvider).

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems) and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer program instructions. These computer program instructions maybe provided to a processor of a general purpose computer, specialpurpose computer, or other programmable data processing apparatus toproduce a machine, such that the instructions, which execute via theprocessor of the computer or other programmable data processingapparatus, create means for implementing the functions/acts specified inthe flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computerreadable medium that can direct a computer, other programmable dataprocessing apparatus, or other devices to function in a particularmanner, such that the instructions stored in the computer readablemedium produce an article of manufacture including instructions whichimplement the function/act specified in the flowchart and/or blockdiagram block or blocks.

The computer program instructions may also be loaded onto a computer,other programmable data processing apparatus, or other devices to causea series of operational steps to be performed on the computer, otherprogrammable apparatus or other devices to produce a computerimplemented process such that the instructions which execute on thecomputer or other programmable apparatus provide processes forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof code, which comprises one or more executable instructions forimplementing the specified logical function(s). It should also be notedthat, in some alternative implementations, the functions noted in theblock may occur out of the order noted in the figures. For example, twoblocks shown in succession may, in fact, be executed substantiallyconcurrently, or the blocks may sometimes be executed in the reverseorder, depending upon the functionality involved. It will also be notedthat each block of the block diagrams and/or flowchart illustration, andcombinations of blocks in the block diagrams and/or flowchartillustration, can be implemented by special purpose hardware-basedsystems that perform the specified functions or acts, or combinations ofspecial purpose hardware and computer instructions.

It should be noted that any of the methods described herein can includean additional step of providing a system comprising distinct softwaremodules embodied on a non-transitory computer readable storage medium;the modules can include, for example, any or all of the elementsdepicted in the block diagrams and/or described herein; by way ofexample and not limitation, a memory health tracking module and aselective reliability manager module. The method steps can then becarried out using the distinct software modules and/or sub-modules ofthe system, as described above, executing on one or more hardwareprocessors 402. Further, a computer program product can include anon-transitory computer-readable storage medium with code adapted to beimplemented to carry out one or more method steps described herein,including the provision of the system with the distinct softwaremodules.

In any case, it should be understood that the components illustratedherein may be implemented in various forms of hardware, software, orcombinations thereof; for example, application specific integratedcircuits (ASICs), functional circuitry, one or more appropriatelyprogrammed general purpose digital computers with associated memory, andthe like. Given the teachings of the invention provided herein, one ofordinary skill in the related art will be able to contemplate otherimplementations of the components of the invention.

The terminology used herein is for the purpose of describing particularembodiments only and is not intended to be limiting of the invention. Asused herein, the singular forms “a,” “an” and “the” are intended toinclude the plural forms as well, unless the context clearly indicatesotherwise. It will be further understood that the terms “comprises”and/or “comprising,” when used in this specification, specify thepresence of stated features, integers, steps, operations, elements,and/or components, but do not preclude the presence or addition of oneor more other features, integers, steps, operations, elements,components, and/or groups thereof.

The corresponding structures, materials, acts, and equivalents of allmeans or step plus function elements in the claims below are intended toinclude any structure, material, or act for performing the function incombination with other claimed elements as specifically claimed. Thedescription of the present invention has been presented for purposes ofillustration and description, but is not intended to be exhaustive orlimited to the invention in the form disclosed. Many modifications andvariations will be apparent to those of ordinary skill in the artwithout departing from the scope and spirit of the invention. Theembodiment was chosen and described in order to best explain theprinciples of the invention and the practical application, and to enableothers of ordinary skill in the art to understand the invention forvarious embodiments with various modifications as are suited to theparticular use contemplated.

What is claimed is:
 1. A method for providing selective memory errorprotection responsive to a predictable failure notification associatedwith at least one portion of a memory in a computing system, the methodcomprising: obtaining an active error correcting code (ECC)configuration corresponding to the at least one portion of the memory;determining whether the active ECC configuration is sufficient tocorrect at least one error in the at least one portion of the memoryaffected by the predictable failure notification; when the active ECCconfiguration is insufficient to correct the at least one error,determining whether data corruption can be tolerated by an applicationrunning on the computing system; when data corruption cannot betolerated by the application, determining whether a stronger ECC levelis available and, if a stronger ECC level is available, increasing astrength of the active ECC configuration; and when data corruption canbe tolerated, performing page reassignment and aggregation ofnon-critical data.
 2. The method of claim 1, further comprisinggenerating a failure notification notifying the application of animminent unrecoverable error when data corruption cannot be tolerated bythe application and a stronger ECC level is not available.
 3. The methodof claim 2, wherein generating the failure notification notifying theapplication of an imminent unrecoverable error comprises monitoring avariation in a number of corrected error events produced by an ECCmechanism currently being used for the at least one portion of thememory.
 4. The method of claim 1, wherein the step of performing pagereassignment and aggregation of non-critical data comprises performingmemory mapping to identify memory pages and processes expected to beaffected by the predictable failure notification.
 5. The method of claim1, wherein obtaining the active ECC configuration comprises receiving,from a memory controller in the computing system, at least one parameterregarding an ECC mechanism currently being used for the at least oneportion of the memory affected by the predictable failure notification.6. The method of claim 5, wherein the at least one parameter comprisesat least one of correcting code length and correction capabilitycorresponding to the ECC mechanism currently being used for the at leastone portion of the memory.
 7. The method of claim 1, further comprising:receiving at least one sensor signal generated by at least onecorresponding sensor configured to monitor a health status of the atleast one portion of the memory; and generating the failure notificationindicative of a predictable failure of the at least one portion of thememory as a function of the at least one sensor signal.
 8. The method ofclaim 1, wherein increasing the strength of the active ECC configurationcomprises increasing at least one of a correcting code length and acorrection capability corresponding to an ECC mechanism currently beingused.
 9. The method of claim 1, further comprising classifying data usedby the application based on a tolerance of the data to corruption,wherein a resolution of the selective memory error protection iscontrolled as a function of a classification of the data.
 10. The methodof claim 9, wherein determining whether data corruption can be toleratedby the application running on the computing system is performed by anapplication program interface (API) of the application indicatingnon-critical data based on the classification of the data.
 11. Themethod of claim 9, further comprising partitioning the memory into aplurality of blocks and allocating data to the blocks as a function ofthe classification of the data.
 12. The method of claim 11, furthercomprising independently controlling an error correcting functionalityimplemented by each of the plurality of blocks of memory.
 13. The methodof claim 11, further comprising assigning at least two different ECClevels implemented by corresponding ECC mechanisms in at least twodifferent blocks of memory.
 14. An apparatus for performing selectivememory error protection in a high-performance computing system, theapparatus comprising: a memory; and at least one processor coupled tothe memory, the processor, responsive to a predictable failurenotification associated with at least one portion of the memory, beingconfigured: to obtain an active error correcting code (ECC)configuration corresponding to the at least one portion of the memory;to determine whether the active ECC configuration is sufficient tocorrect at least one error in the at least one portion of the memoryaffected by the predictable failure notification; to determine, when theactive ECC configuration is insufficient to correct the at least oneerror, whether data corruption can be tolerated by an applicationrunning on the computing system; to determine, when data corruptioncannot be tolerated by the application, whether a stronger ECC level isavailable and if, a stronger ECC level is available, to increase astrength of the active ECC configuration; and, when data corruption canbe tolerated, to perform page reassignment and aggregation ofnon-critical data in the memory.
 15. The apparatus of claim 14, whereinthe processor is further configured to generate a failure notificationnotifying the application running on the computing system of an imminentunrecoverable error when data corruption cannot be tolerated by theapplication and a stronger ECC level is not available.
 16. The apparatusof claim 14, wherein the processor is further configured to monitor avariation in a number of corrected error events produced by an ECCmechanism used in the at least one portion of the memory.
 17. Theapparatus of claim 14, wherein the processor is further configured toreceive at least one parameter regarding an ECC mechanism currentlybeing used for the at least one portion of the memory affected by thepredictable failure notification to thereby obtain the active ECCconfiguration corresponding to the at least one portion of the memory.18. The apparatus of claim 14, further comprising a memory healthtracking module configured to receive at least one sensor signalgenerated by at least one corresponding sensor configured to monitor ahealth status of the at least one portion of the memory and to generatea failure notification indicative of a predictable failure of the atleast one portion of the memory as a function of the at least one sensorsignal.
 19. The apparatus of claim 14, wherein the memory comprises aplurality of memory banks and a plurality of ECC modules, each of theECC modules being operatively coupled with a corresponding one of thememory banks, the processor being configured to classify data used bythe application based on a tolerance of the data to corruption and toallocate the data to the memory banks as a function of a classificationof the data.
 20. The apparatus of claim 14, wherein the processorcomprises a memory controller coupled with the memory and configured tovary a strength of an ECC mechanism used to protect data stored in thememory.